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ABSTRACT 

The  aim  of  this  work  is  to  automatically  extract  structured  information  from  unstructured  texts,  permitting 
their  fusion  in  an  intelligence  application.  In  Thales,  we  have  a  knowledge  management  system  (Ideliance) 
that  permits  us  to  manage  entities  and  relations  between  them,  but  at  present  the  user  must  manually 
capture  this  information.  To  automate  such  an  extraction,  we  propose  the  use  of  a  learning  algorithm  that 
we  have  developed  after  the  study  of  the  existing  information  extraction  methods.  We  present  the  Sen 7+ 
tool  that  implements  the  algorithm,  and  the  evaluation  of  this  tool  carried  out  by  us  and  by  the  Land 
Headquarter  (S.T.A.T.  unit). 


1  INTRODUCTION 

The  aim  of  this  work  is  to  automatically  extract  structured  information  from  unstructured  texts,  permitting 
their  fusion  in  an  intelligence  application.  For  us,  a  structured  information  is  composed  of  a  relation 
between  two  entities,  for  example  a  relation  of  sales  or  purchasing  between  two  companies,  or  a  relation  of 
location  between  a  person  and  a  place. 

In  Thales,  we  have  a  knowledge  management  system  (Ideliance)  that  permits  us  to  manage  entities  and 
relations  between  them,  but  at  present  the  user  must  manually  capture  this  information.  Our  aim  is  to  ease 
the  user  task  by  allowing  the  automatic  acquisition  of  such  knowledge  from  texts. 

To  automate  such  an  extraction,  we  propose  the  use  of  learning  methods.  We  briefly  present  our  work 
below:  we  have  studied  existing  methods  and  chosen  the  most  efficient,  we  have  developed  the  Sem+ 
system,  which  automates  the  learning  method  to  extract  information,  and  we  have  performed  two 
evaluations  on  different  corpus  to  validate  our  approach. 


2  INFORMATION  EXTRACTION  BEFORE  FUSION 

The  most  important  prerequisite  for  fusing  information  is  to  identify  and  extract  the  relevant  information. 
In  most  cases  this  step  is  done  manually.  In  this  paper  we  describe  a  method  to  automatically  extract 
relevant  information  from  knowledge  provided  by  the  user. 


2.1  Ideliance 

In  Thales  we  have  the  Ideliance  tool,  which  is  a  knowledge  management  system  based  on  the  concept  of 
semantic  networks  (Rohmer,  2002).  The  aim  of  the  conceptors  was  to  offer  the  easy  use  of  such  a  tool,  for 


Goujon,  B.;  Frigiere,  J.  (2006)  Extraction  of  Relations  between  Entities  from  Texts  by  Learning  Methods.  In  Information  Fusion  for 
Command  Support  (pp.  9-1  -  9-8).  Meeting  Proceedings  RTO-MP-IST-055,  Paper  9.  Neuilly-sur-Seine,  France:  RTO.  Available  from: 
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users  without  specific  notions  in  “knowledge  representation”.  The  manipulated  knowledge  has  the  format 
of  a  triplet  “subject  /  verb  /  complement”,  as  in  “Peter  /  is  from  the  category  /  Person”,  “Peter  /  is  working 
for  /  Thales”,  etc. 

The  main  limitation  of  this  tool  is  the  capture  of  the  knowledge.  At  present,  it  must  be  done  manually.  A 
great  improvement  to  this  tool  would  be  the  automation  of  the  capture  of  all  the  relations.  To  do  so,  we 
have  worked  on  the  automatic  information  extraction. 


2.2  Information  extraction:  state  of  the  art 

Various  methods  are  proposed  to  automate  information  extraction  in  various  contexts.  We  have  identified 
three  main  approaches:  declarative  approaches,  statistical  approaches  and  supervised  learning  approaches. 
Here  is  the  presentation  of  these  approaches,  with  comments  on  their  interests  and  limits. 


2.2.1  Declarative  approaches 

A  recent  work  (Bouhafs-Hafsia)  was  done  on  the  description  of  all  necessary  French  linguistic  knowledge 
to  extract  information  in  an  intelligence  context.  The  aim  of  this  approach  was  to  describe  beforehand  all 
the  knowledge  that  is  necessary  to  identify  precise  information.  This  description  of  linguistic  knowledge  is 
done  by  a  specialist  in  the  area  from  the  study  of  a  corpus. 


Eleven  relations  were  defined:  Negociation,  CoLocation  (two  persons  in  a  same  location),  Confrontation, 
Communication,  etc.  Each  of  these  relations  is  associated  to  a  set  of  words  explaining  the  relation 
:“marchander”,  “traiter”,  “negocier”  for  the  Negociation;  “retrouver”,  “recevoir”,  “contacter”,  “rejoindre”, 
“voir”,  etc.  for  the  CoLocation,  ...1.  Most  of  these  words  are  verbs.  These  words  are  grouped  together  and 
associated  to  specific  rules  according  to  contexts  that  may  be  encountered:  we  can  have  “X  et  Y  ont 
negocie  ...”,  “X  a  traite  avec  Y  ...”  for  the  Negociation  verbs,  “X  a  contacte  Y”2  for  the  CoLocation 
verbs.  The  implementation  of  these  rules  is  based  on  the  contextual  exploration  principle  (Descles,  1997). 


The  strong  point  of  such  an  approach  is  the  efficiency  of  the  precise  rules  to  capture  most  of  the 
information  associated  with  the  defined  relations. 


Numerous  weak  points  are  associated  with  such  a  method.  First,  this  method  is  expensive:  each  relation 
must  be  defined  for  each  language,  by  a  linguist  after  a  specific  study  of  a  representative  coipus.  Second, 
if  end  users  are  concerned  by  another  relation  that  was  not  previously  defined,  they  can’t  extract  anything, 
they  just  have  to  wait  for  the  linguist  to  provide  the  complete  description  (words,  rules)  of  the  new 
relation.  And  thirdly,  this  approach  may  be  unusable  in  a  strategic  intelligence  domain  where  end  users 
don’t  want  anyone  to  observe  and  manipulate  their  confidential  data. 


2.2.2  Statistical  approaches 

The  opposite  approach  is  the  statistical  ones,  which  does  not  need  any  linguistic  knowledge.  The  principle 
is  to  apply  statistical  calculations  to  identify  words  that  are  frequently  together  in  texts,  which  signifies 
that  a  relation  exists  between  these  words,  according  to  the  distributional  hypothesis  of  Harris  (1971). 

Such  an  approach  has  the  advantage  of  not  needing  predefined  specific  knowledge  to  be  usable.  Also,  the 
identified  information  is  obtained  without  prejudice. 


1  to  haggle  over,  to  deal,  to  negociate,  ...  to  meet  again,  to  welcome,  to  contact,  to  rejoin,  to  see 

2 

“X  and  Y  have  negociated  . . “X  has  dealt  with  Y  ...”  . . .,  “X  has  contacted  Y” 
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Unfortunately,  relations  that  are  obtained  are  sub-specified:  if  such  an  approach  identifies  a  relation 
between  two  persons,  which  is  the  precise  relation:  are  they  friends  or  enemies?  were  they  in  the  same 
place?  are  they  members  of  a  common  group?  With  such  methods,  users  have  to  analyse  resulting  data  in 
order  to  identify  the  specific  relation.  Finally,  it  does  not  provide  a  precise  automatic  relation  extraction. 

2.2.3  Supervised  learning  approaches 

Other  approaches  are  based  on  learning  methods  to  extract  information.  Most  of  these  methods  are  used  to 
ease  the  terminologists  or  knowledge  engineers  tasks  (Aussenac-Gilles  and  al,  2000),  the  construction  of 
dictionaries  for  instance  (Riloff,  1993).  Some  are  based  on  predefined  knowledge  like  WordNet  (Bagga 
1 997)  to  automate  the  production  of  pertinent  patterns  according  to  a  relation,  but  such  knowledge  input  is 
not  directly  usable  (according  to  the  domain,  a  word  may  have  various  meanings,  so  users  may  first  have 
to  select  pertinent  words  and  meanings  according  to  their  needs). 

We  have  studied  three  learning  methods  adapted  to  extract  information  from  texts,  based  on  linguistic 
approaches: 

•  Rapier  (Califf,  Mooney  2003)  learns  rules  or  extraction  patterns  from  an  annotated  learning 
corpus.  It  is  based  on  syntactical  analysis,  on  an  algorithm  applying  compression  rules,  and  on 
rules  weighting.  Rules  are  formalized  in  a  way  not  adapted  for  non  linguists  (they  use  three  filler 
patterns  containing  semantic  and  part-of-speech  tags). 

•  ExDisco  (Yangarber  2000)  identifies  a  set  of  relevant  documents  and  a  set  of  event  patterns  from 
un-annotated  texts,  starting  from  a  small  set  of  weighted  "seed  patterns".  These  seed  patterns  are 
expanded  to  identify  most  of  the  various  forms  of  the  language  expressing  each  relation  in  the 
documents.  A  limitation  of  this  method  is  the  use  of  a  large  training  coipus,  necessary  to 
efficiently  weight  documents  and  patterns. 

•  Promethee  (Morin  1999)  incrementally  learns  a  set  of  lexico-syntactical  patterns  expressing  a 
semantic  relation  from  a  set  of  terms  linked  by  this  relation.  It  is  based  on  the  Hearst  algorithm 
which  consists  of  having  a  first  set  of  couples  verifying  an  interesting  relation,  and  constructing 
the  corresponding  set  of  patterns  expressing  this  relation.  These  patterns  permit  the  identification 
of  new  couples,  that  bring  back  new  patterns...  Patterns  are  expressed  with  regular  expression 
notations:  “CNi  is?  a  NP2  company”  (where  CN  represents  company  name  and  NP  a  noun 
phrase),  which  are  not  easy  to  manipulate  for  non-specialists. 

2.3  Our  approach  for  the  strategic  intelligence 

We  have  chosen  to  exploit  a  learning  method,  which  is  a  good  compromise  between  the  costly  descriptive 
methods  and  the  inaccurate  statistical  methods.  Our  aim  is  to  adapt  a  learning  method  to  the  specific 
constraints  of  the  intelligence  domain:  we  can’t  learn  from  large  coipus  (they  don’t  exist  on  a  new  event) 
neither  from  annotated  corpus  (annotation  by  linguists  of  confidential  documents  is  not  conceivable).  For 
the  same  reasons,  we  can’t  use  adapted  ontology  developed  by  a  linguist  or  a  terminologist.  And  finally, 
we  can’t  provide  a  method  requiring  linguistic  knowledge  to  users  specialized  in  intelligence  and  not 
language. 

2.3.1  Learning  algorithm 

Our  algorithm  is  based  on  the  Promethee  algorithm  but  adapted  to  the  intelligence  domain  as  previously 
explained.  It  is  applied  on  a  pre-tagged  text  with  entities.  Here  are  the  steps  of  this  algorithm: 

1.  Selection  by  the  user  of  a  couple  of  entity  categories  concerned  by  the  relevant  relation; 

2.  Capture  by  the  user  of  couples  of  entities  verifying  the  relation; 
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3.  Automatic  recovery  of  sentences  containing  these  couples,  with  patterns  that  potentially  describe 
the  relation; 

4.  Selection  by  the  user  of  the  sentence  extracts  expressing  the  relation  and  the  automatic 
transformation  of  these  extracts  into  patterns  with  Intex. 

5.  Use  of  the  patterns:  recovery  of  new  couples.  Back  to  the  step  2. 

In  order  to  simplify  the  reading  of  the  corpus  by  the  user  and  the  extraction  of  the  couples,  the  corpus  was 
previously  reduced  according  to  the  couple  of  entity  categories  that  is  concerned  by  the  relation.  At  step  4, 
the  user  may  transform  the  sentence,  if  some  words  are  not  meaningful,  to  express  the  relation  by  using 
“**”  instead  of  the  optional  words. 

2.3.2  Example 

To  illustrate  our  approach,  here  is  an  example  from  a  French  coipus.  From  the  following  sentences:  “...  le 
president  Chirac  avait  telephone  mercredi  au  president  ivoirien  Laurent  Gbagbo  . . .  Laurent  Gbagbo  a  repu 
hier  en  sa  residence  de  Cocody,  l’ancien  president  Henri  Konan  Bedie  . .  ,”3  users  will  obtain  the  following 
relations:  “Jacques  Chirac  -  A  CONTACTE  -  Laurent  Gbagbo”  and  “Laurent  Gbagbo  -  A  RENCONTRE 
-  Henri  Konan  Bedie”4.  These  relations  will  then  be  merged  in  their  knowledge  management  system.  To 
do  so,  the  first  pattern  “Jacques  Chirac  avait  telephone  **  a  Laurent  Gbagbo”  is  created  by  the  users  from 
the  reduced  lemmatised  coipus,  with  this  additional  information:  “Jacques  Chirac”  is  the  agent,  “A 
CONTACTE”  is  the  relation  expressed  by  the  pattern  and  “Laurent  Gbagbo”  is  the  patient.  Applied  on 
another  coipus  containing  “Jacques  Chirac  avait  telephone  le  17  juillet  au  president  Kostunica  ...”,  this 
algorithm  will  automatically  produce  the  new  relation  “Jacques  Chirac  -  A  CONTACTE  -  Vojislav 
Kostunica”. 

3  SEM+ 

We  present  here  Sem+,  the  tool  containing  our  learning  algorithm,  and  its  evaluation. 

3.1  First  implementation 

The  first  version  of  the  Sem+  system  was  developed  by  Julia  Frigiere  in  2004  in  Java  (Frigiere,  2004).  It  is 
based  on  Intex,  a  linguistic  development  environment  (Silberstein),  and  is  developed  in  Java  and  Perl. 

Before  the  use  of  Sem+,  the  users  had  to  construct  their  own  domain  dictionaries,  compiling  their  lists  of 
words  to  each  category  (one  text  file  for  each  entity  category).  Lemma  can  be  used  to  unify  outputs. 

Here  are  the  steps  for  using  of  Sem+  (see  the  figure  below). 

1.  Selection  of  a  coipus,  and  automatic  tagging  of  the  named  entities  described  in  the  dictionaries 
(for  example,  companies. die  contains  Transiciel  and  Gencom). 

2.  Reduction  of  the  corpus  according  to  a  couple  of  entity  categories  (Company  in  our  example). 

3.  Learning  step  (the  two  sub-steps  below  could  be  applied  constantly): 

a.  Capture  of  the  entity  couples  to  initiate  the  learning  approach  /  recovery  of  new  couples 
obtained  from  the  captured  patterns. 

3  “the  president  Chirac  had  phoned  on  Wednesday  to  the  president  of  the  Ivory  Coast  Laurent  Gbagbo  . . .  Laurent  Gbagbo  has 
received  yesterday,  in  his  residence  of  Cocody,  the  previous  president  Henri  Konan  Bedie”. 

4  “Jacques  Chirac  -  HAS  CONTACTED  -  Laurent  Gbagbo”  and  “Laurent  Gbagbo  -  HAS  MET  -  Henri  Konan  Bedie” 
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b.  Capture  of  patterns  associated  with  couples.  The  capture  is  easy,  and  consists  of  a  copy- 
paste  of  the  sentence  extract  (pattern)  containing  the  relation,  as:  “Transiciel  rachete 
Gencom”5.  The  user  validates  the  agent,  the  patient,  and  the  category  of  the  relation 
expressed  in  the  extract.  For  each  extract,  a  transducer  is  automatically  created  with  Intex 
to  obtain  for  example  ACFlAT(Companyl,  Company2)6  from  “Companyl  rachete 
Company2”. 

4.  Use  of  the  pattern  set  to  extract  relations  on  new  coipus:  “eBay  rachete  iBazar”  =>  ACFlAT(eBay, 
iBazar). 

5.  Production  of  a  file  with  the  Ideliance  format,  in  order  to  export  these  relations  into  the  knowledge 
management  system. 


1 1 801  _3.  Schneider  Electric  acquiert  Emit  Transformers. 
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1 1 801  _6.  En  marge  de  sa  fusion  avec  Legrand,  Schneider  Electric  vient  c 
1 1 801  _9.  EmitTransformers  est  specialisee  dans  la  fabrication  et  la  con 
1 1 795  _1 7.  Schlumberger  ne  prevoit  pas  de  proceder  a  des  cessions  de 
1 1 795  _1 9.  Apres  I'annonce  ce  matin  du  rachat  de  Serna  Group  pour  5,7 1 
1 1 795  _23.  Cap  Gemini  avait  notamment  indique  qu'il  etait  interesse  par 
1 1 792  _28.  eBay  pourrait  racheter  iBazar. 

1 1 792  _31 .  Selon  les  informations  rapportees  par  le  Wall  Street  Journal,  ■ 
iBazar. 

1 1 792  _33.L'information  doit  toutefois  etre  prise  a  le  conditionnel :  il  y  a  qi 
valorisation  implicite  de  3  milliards  de  francs  (plus  de  450  millions  de  eur 
beaucoup  chute). 

1 1 792  _34.Les  relations  entre  les  deux  groupes  sont  anciennes  :  (de  ma 
1 1 791  _40.  TotalFinaElf  rachete  la  part  de  Statoil  dans  Kachagan. 


1 1 801 .  Schneider  Electric  acquiert  EmitTransformers.  EmitTransformers  est  une  societe  polonaise  realisant 
un  chiffre  d'affaires  de  5  millions  d'euros.  SOC«2001-02-1 2  1 7:38:00.000.  En  marge  de  sa  fusion  avec 
Legrand,  le  groupe  Schneider  Electric  vient  de  proceder  a  une  autre  (petite)  operation  de  croissance  externe.  II 
s'agit  de  la  societe  polonaise  EmitTransformers  (5  millions  d'euros  de  chiffre  d'affaires)  qui  vient  conforter  la 
strategie  du  groupe  de  se  renforcer  sur  les  marches  des  pays  de  I'est.L'operation  est  menee  a  partir  de  la 
filiale  polonaise  du  groupe,  Schneider  Electric  Polska.  EmitTransformers  est  specialisee  dans  la  fabrication  et 
la  commercialisation  de  transform  ate  urs  de  distribution  MT/BT  de  type  sec  et  immerge.Elle  est  situee  dans  la 
ville  de  Zychlin  a  1 00km  de  Varsovie  et  compte  un  effectif  d'environ  1 00  personnes.  "Cette  societe  esttres  bien 
implantee  sur  le  marche  local,  precise  Gilles  Revellin,  responsable  pays  Pologne.  Elle  nous  permettra  de 
renforcer  notre  offre  en  moyenne  et  basse  tensions  sur  les  marches  de  I'energie  et  du  batiment". 

1 1 795.  Serna  Group  :  pas  de  cession  d'actifs.  Schlumberger  ne  prevoit  pas  de  proceder  a  des  cessions 
d'actifs  apres  le  rachat  de  Serna  Group.  NEW«2001  -02-1 2  1 5:03:00.000.  Apres  I'annonce  ce  matin  du  rachat 
de  Serna  Group  pour  5,7  milliards  d'euros,  le  groupe  franco-americain  Schlumberger  a  donne  quelques 
indications  sur  sa  strategie.  Reprenant  un  groupe  traumatise  par  une  communication  financiere  defectueuse, 
les  dirigeants  de  Schlumberger  ont  toutefois  indique  qu'ils  ne  comptaient  pas  proceder  a  de  nouvelles 
cessions  d'actifs." Nous  n'avons  actuellement  aucune  intention  de  vendre  des  divisions  de  Serna",  a  declare  le 
PDG  Euan  Baird  dans  le  cadre  d'une  conference  de  presse  telephonique.  Cette  precision  permet  de  repondre 
a  quelques  interrogations  qui  subsistaient  sur  les  conditions  de  la  fusion.  Cap  Gemini  avait  notamment 
indique  qu'il  etait  interesse  par  certains  actifs  de  Serna  Group  mais  pas  par  I'ensemble.  L'OPA  a  520  pence 
par  action  represente  une  prime  de  42  %  par  rapport  au  2  fevrier  (annonce  des  negociations). 

1 1 792.  eBay  pourrait  racheter  iBazar.  Le  site  americain  de  vente  aux  encheres  offrirait  1 00  millions  d'euros. 
RFHa7f)fl1 -07-17  1418  00  fini)  Bplnn  Ips  informations  rannnrtpp<ynar  Ip  Wall  Rtrppt.lnnmal  pRav-  Ip  Iparlpr 


1  7Q1  A1  Qnrg'inm.m.l  T  1  ^  nn  nnngTntalFinaFlf  wiant  rla  rarhotar  la  nartirinatinn  rio 


q  Qtatnil  rianc  It 


ant  naant  nffchnra  do  Rarhanai 


(1) 


TEXTE  Ouvrir 


(2) 


Reduire  corpus  selon 


NPSOC  ▼  NPSOC 


(3) 


couples  lies  par  une  relation 


APPRi  N  i  iSSAf.F  debut  apprentissage 


entrer  fichier  de  couples 


chercher  patrons 


chercher  couples 


fin  apprentissage 


EXTRACTION  calculer  relations  generer  fichier  Ideliance 


(4) 


(5) 


Fig.  1:  Example  of  the  Sem+  interface. 

Sem+  and  the  underlying  algorithm  are  adapted  to  various  languages  (English,  French,  ...),  as  the  patterns 
are  developed  from  the  corpus.  They  are  also  efficient  for  various  users,  because  they  can  define  their  own 
entity  categories,  their  own  relations  and  select  each  pattern  expressing  them  without  any  linguistic 
knowledge. 

3.2  Evaluation 

We  have  done  two  evaluations:  the  first  was  carried  out  on  the  initial  corpus,  which  dealt  with  sales  and 
purchasing  relations  between  companies,  and  the  second  was  performed  on  a  more  realistic  corpus  dealing 
with  the  Ivory  Coast  events,  with  seventeen  relations  between  four  entity  categories. 


5  “Transiciel  buys  out  Gencode.” 

PURCHASE(Companyl,  Company2) 
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3.2.1  First  evaluation  on  the  financial  corpus 

The  first  evaluation  (Frigiere,  2004)  was  done  on  a  financial  corpus,  to  test  the  patterns  coverage  and  the 
efficiency  of  the  learning  cyclical  algorithm.  There  was  only  one  entity  category:  Company,  and  two 
relations  between  companies:  Sale,  Purchasing. 

On  a  sub-corpus  containing  201  documents,  Sem+  was  used  to  capture  45  patterns.  These  patterns 
permitted  the  identification  of  59  relations  between  companies  (out  of  the  119  of  this  corpus).  The  recall 
was  41%  and  the  precision  was  98%.  Several  reasons  have  caused  the  low  recall:  firstly,  only  the 
sentences  containing  two  entities  are  taken  into  account,  even  if  sometimes  relations  may  be  built  on  two 
sentences  (with  anaphora).  Secondly,  in  the  implemented  version  of  Sem+,  the  patterns  began  with  an 
entity  and  ended  with  the  other  entity,  so  patterns  such  as:  “T achat  de  Arisem  par  Thales”  could  not  be 
exploited. 

To  evaluate  the  efficiency  of  the  learning  algorithm,  20  couples  of  companies  verifying  a  relation  of  sale 
or  purchasing  were  identified  from  the  sub-corpus.  They  permitted  the  capture  by  the  user  of  28  patterns. 
These  patterns  were  used  to  extract  two  other  couples,  supplying  4  new  patterns.  This  result  was 
motivating,  as  it  illustrates  the  efficiency  of  our  method  to  easily  learn  new  patterns:  20  couples  permitted 
the  capture  of  32  patterns. 

3.2.1  Second  evaluation  on  Ivory  Coast  corpus 

The  aim  of  this  evaluation  was  to  validate  the  usability  of  such  an  approach  by  a  user  without  specific 
linguistic  knowledge,  and  to  test  the  method  on  another  domain  with  a  lot  of  relations  and  entity 
categories.  This  evaluation  was  done  at  the  Land  Fleadquarter  (S.T.A.T.  unit),  by  the  Officer  Cadet 
Cytermann  (Cytermann,  2005).  Fie  used  Sem+  on  a  corpus  composed  of  journalistic  articles  related  to  the 
Ivory  Coast  crisis.  For  this  evaluation,  no  specific  criterion  were  used  to  quantify  the  results. 

After  some  technical  adjustment  and  a  first  test,  it  was  concluded  that  Sem+  was  easy  to  use,  and  that  it 
was  efficient  in  saving  time  when  coupled  with  the  Ideliance  system.  It  was  also  remarked  that  Sem+  was 
efficient  for  manipulating  a  lot  of  entity  categories  and  relations:  for  the  evaluation,  four  entity  categories 
were  used:  Person,  Organization,  Location  and  Mean,  and  seventeen  relations:  Meeting(Person,  Person), 
Appointment(Person,  Person),  Support(Organization,  Organization),  Moving(Person,  Location),  etc. 

On  this  corpus,  the  learning  algorithm  was  not  efficient,  maybe  because  of  the  small  size  of  the  acquisition 
corpus  (120  sentences).  Also,  there  were  a  lot  of  relations  to  identify,  so  the  number  of  cases  for  each 
relation  was  not  large  enough  for  an  efficient  learning  approach.  This  focuses  on  the  characteristics  of  the 
initial  corpus,  that  is  essential  in  a  learning  approach:  if  the  coipus  is  not  large  enough,  it  may  not  provide 
enough  relation  patterns  to  automate  the  information  extraction. 

3.2.2  Entity  tagging 

The  tagging  step  requires  the  use  of  dictionaries  associated  to  each  entity  category.  For  example,  on  the 
financial  corpus,  we  built  a  company  name  dictionary,  containing  lemmas  (ex:  British 
Telecommunications  is  associated  to  the  lemma  British  Telecom).  We  also  use  a  dictionary  containing  the 
potential  initial  sections  of  the  company  names,  defined  by  T.  Poibeau  (Poibeau,  2003):  “le  groupe 
Thales”,  “la  societe  Thales”7  are  then  reduced  into  “Thales”. 

For  the  evaluation  on  the  Ivory  Coast  corpus,  the  user  also  defined  a  dictionary  for  each  entity  category. 
But,  as  there  was  no  dictionary  to  manage  the  potential  initial  sections,  they  were  included  in  the 


7  Thales  group,  Thales  company 
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dictionary:  “le  president  Jacques  Chirac”8  was  an  input  of  the  person  name  dictionary.  To  improve  this,  we 
have  to  efficiently  manage  these  words  that  are  parts  of  the  person  description. 

Another  difficulty  was  encountered:  France  was  an  element  of  the  location  dictionary,  but  in  a  lot  of 
sentences  its  signification  was  more  “the  organisation  of  the  country”  than  “the  country”,  as  in:  “La  France 
accuse  Gbagbo”9.  The  use  of  a  location  word  to  express  a  location  or  the  associated  organisation  must  be 
managed  specifically  in  order  to  improve  the  results  of  our  system. 


4  CONCLUSION 

We  have  worked  on  the  automatic  extraction  of  relations  from  texts.  To  do  so,  we  have  defined  a  specific 
algorithm,  and  implemented  it  with  the  Sem+  tool.  Sem+  is  easy  to  use  for  users  without  linguistic 
knowledge.  It  eases  the  reading  of  a  coipus  by  reducing  it  according  to  entity  categories  pertinent  to  a 
specific  relation.  It  also  capitalizes  the  patterns  expressing  the  relations,  to  reuse  them  for  the  automatic 
extraction  of  new  relations. 

As  detailed  previously,  some  evaluations  were  done  to  identify  the  interest  of  the  approach  to  extract 
relations  in  an  intelligence  context,  as  well  as  to  identify  the  improvements  that  must  be  provided.  First, 
we  want  to  simplify  the  use  of  the  Sem+  tool,  and  thanks  to  the  remarks  made  by  the  Officer  Cadet 
Cytermann,  we  have  listed  some  elements  to  improve:  the  management  of  the  entity  dictionaries  must  be 
done  from  the  Sem+  interface;  pre-filling  of  some  information  may  improve  the  user  task,  etc.  Secondly, 
we  want  to  improve  the  learning  algorithm  efficiency,  by  improving  the  quality  of  the  prior  entity  tagging 
step,  by  enlarging  the  patterns  with  the  use  of  lemmas,  by  permitting  the  use  of  patterns  beginning  with  a 
noun  (“F achat  de  Arisem  par  Thales”10),  and  by  studying  the  characteristics  of  the  acquisition  corpus.  As 
we  stated,  users  don’t  have  large  coipus  on  their  intelligence  subjects,  but  they  can  use  a  coipus  on  a 
similar  topic,  which  may  contain  the  pertinent  relations.  The  use  of  lemmas  will  necessitate  the  use  of  a 
generic  dictionary  of  the  language:  with  the  Intex  system,  we  have  such  dictionaries  for  French  and 
English.  For  the  other  languages,  the  use  of  lemmas  will  be  constrained  by  the  possession  of  the  adequate 
dictionary. 

As  we  said  previously,  Sem+  automatically  provides  results  that  are  usable  by  the  Ideliance  system.  We 
are  also  working  with  another  team  in  THALES  Research  &  Technology  to  provide  results  for  a  specific 
fusion  module  (Laudy,  2005). 

We  recently  experimented  with  Sem+  in  an  audio  filtering  platform:  relation  patterns  were  used  to  extract 
relations  from  audio  transcriptions  of  English  audio  data.  This  experiment  showed  the  usability  of  Sem+ 
on  English  data,  and  a  new  efficient  way  to  exploit  the  patterns  defined  from  text  to  extract  information 
from  audio  data. 
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•  Information  fusion  need 

■  Ideliance:  knowledge  management  tool  based  on  semantic  networks 

■  data  format:  triplet  “  subject  /  verb  /  complement  ” 

■  manual  addition  of  knowledge 

■  information  sources:  texts 

•  Improvement 

■  automate  the  extraction  of  information  from  texts 

■  automate  the  insertion  of  this  information  into  Ideliance  bases 
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Motivation  -  Example  O 


T 


Alassane  Ouattara  will  arrive  in  Pretoria 


“  <Person>  Alassane  Ouattara  </Person>  <Move>  will 
arrive  </Move>  in  <Location>  Pretoria  </Location>  ” 


Move(Alassane  Ouattara,  Pretoria) 
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Vocabulary  O 


•  Relations,  entity  categories,  entities  :  we  want  to  identify  relations 
between  two  entities. 

■  Relations:  Purchase,  Contact,  Moving... 

■  Entity  categories:  Companies,  Persons,  Locations,  ... 

■  Entities:  Jacques  Chirac,  Thales,  Paris 

•  Relation  Patterns  =  linguistic  representation  of  a  relation 

■  “  Alassane  Ouattara  will  arrive  in  Pretoria  ” 

■  <Person>  will  arrive  in  <Location> 

■  <Person>  <to_arrive_in>  <Location> 

•  Semantic  dictionaries  :  they  contain  the  words,  their  lemmas  and  their 
semantic  category 

■  Alassane  Ouattara,  .Person 

■  Mister  Ouattara,  Alassane  Ouattara. Person 

■  Pretoria,  .Location 
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Information  Extraction  Methods  O 


Declarative  approaches 

•  Principles 

■  Use  of  pre-defined  linguistic  knowledge 

■  Relations  are  pre-defined  (Bouhafs-Hafsia  2005):  Deal,  Negociation, 
Contact,  ... 

■  Specific  rules  and  knowledge 

•  Limits 

■  Relation  patterns  must  be  defined  by  linguists 

■  Each  new  pertinent  relation  has  to  be  described  by  linguists 

■  Few  intelligence  specialists  are  linguists  !! 


A.  Bouhafs-Hafsia  (2005),  Utilisation  de  la  methode  d’exploration  contextuelle  pour  une  extraction  d’informations  sur  le 
web  dediees  a  la  veille.  Realisation  du  systeme  informatique  JavaVeille.  PhD  Thesis,  Universite  Paris  4. 
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Statistical  approaches 

•  Principles 

■  No  need  of  any  linguistic  knowledge 

■  Methods  adapted  to  any  domain 

■  Calculations  according  to  co-occurences 

•  Limits 

■  Resulting  relations  are  sub-specified 

■  The  reading  of  source  documents  can’t  be  avoid  !! 
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Supervised  learning  approaches 

•  Some  methods 

■  Rapier  (Califf,  Mooney  2003):  relation  patterns  are  learned  from  annotated 
texts. 

■  ExDisco  (Yangarber  2000):  documents  are  weighted  according  to  their 
relevance,  and  relation  patterns  are  weighted  according  to  their  occurences 
in  documents. 

■  Promethee  (Morin  1999):  from  a  first  set  of  couples  verifying  a  relation, 
construction  of  corresponding  linguistic  patterns  that  may  express  the 
relation,  and  use  of  the  patterns  to  identify  new  couples. 

•  Their  limits 

■  Rapier:  there  is  no  annotated  corpus  for  each  domain. 

■  ExDisco:  it  needs  large  training  corpus. 

■  Promethee:  it  uses  regular  expression  notations,  not  easy  to  use  for  non 
linguists. 
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Our  Information  Extraction  Algorithm  (Jj) 


r 

1 .  capture  of  couples  checking  a  pertinent  relation; 

“  Jacques  Chirac  ”  /  “  Laurent  Gbagbo  ” 

2.  retrieval  of  sentences  containing  those  couples,  so  with  potential  patterns 
expressing  the  relation; 

“  Jacques  Chirac  had  phoned  on  Wednesday  to  Laurent  Gbagbo  ” 

3.  capture  of  patterns  expressing  the  relations  in  those  sentences; 

“  Jacques  Chirac  had  phoned  **  to  Laurent  Gbagbo  ” 

Subject:  Jacques  Chirac,  Relation:  Contact,  Complement:  Laurent  Gbagbo 
<Person>1  had  phoned  **  to  <Person>2  =  Contact(<Person>1,  <Person>2) 

4.  application  of  patterns  :  retrieval  of  novel  couples  =>  2. 

“  Jacques  Chirac  had  phoned  on  the  17th  of  July  to  Vojislav  Kostunica  ” 

=>  “  Jacques  Chirac  ”  /  “  Vojislav  Kostunica  ” 

=>  “Jacques  Chirac  said  in  Paris  that  he  has  invited  Vojislav  Kostunica  ” 
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Sem+  © 


T 


•  Sem+:  eases  the  identification  of  relation  patterns  and  provides 
information  extraction.  It  was  developed  in  Java  (Frigiere). 

•  First  Step:  Learning  stage 

■  Input:  learning  corpus  +  semantic  dictionaries 

■  Output:  relation  patterns 

•  Second  Step:  Information  Extraction 

■  Input:  corpus  +  semantic  dictionaries  +  relation  patterns 

■  Output:  Ideliance-like  relations 

•  Pattern  manager 

■  Intex  (P.  Silberstein):  linguistic  development  environment  based  on  finite- 
state  automata  and  transducer  technologies 
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Sem+  -  Processing  Line  () 
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EXTRACTION  D'INFORMATION 


1  _4.  Henri  Konan  Bedie  a  re$u  a  son  domic 


1  _6.L'ancien  president  de  la  Republique  de  Cote  de  ivoire,  Henri  Konan  bedie  et  son  epouse  ont  re$u  a  diner  rancien  Premier  ministre  Alassane  Ouattara  et  son  epouse,  le  23  septembre. 

1  _15.En  effet,  le  21  septembre  dernier  dans  les  locauxde  r  UNESCO  a  Paris,  les  deux  laureats  de  le  prix  Felix  Houohouet  Boianv  pour  la  recherche  de  la  oaixde  I' UNESCO,  dianitaires  reliaieuxde  oriaine 


bosniaque,  le  cardinal  Roger  Etchegaray  et  I'imam  principal  de  Zagreb,  Mustafa  Ceric,  onl  | 

1  _24.  Le  RDR  cree  en  septembre  1 994  et  qui  etait  une  dissidence  de  le  PDCI-RDA,  avait 
que  le  successeur  constitutionnel,  le  president  de  I' Assemble  nationale,  Henri  Konan  Be 

1  _26.  Mais  avant  meme  de  organiser  de  nouvelles  elections  et  de  presenter  sa  candidate 
electoral  en  novembre  1 994  qui  restreignait  le  droit  de  election  a  les  seuls  citoyens  nes  de 

1  _36.  Faire  des  rencontres  de  Yamoussoukro  un  sommet  de  I'opposition  contre  Laurent 
deja  pour  proposer  la  constitution  de  un  parti  a  I'instar  de  TUMP  qui  a  conduit  a  la  reelectii 

2  _53.  De  son  cote  le  representant  special  de  I'OIF,  Lansana  Kouyate,  mediateur  dans  la 

2  _73.  Les  FN  ont  annonce  avoir  intercepte  deux  camions  bourres  "  de  armement  lourd  "  e 
I'interieurles  FN". 

2  _74.  Abidjan  a  qualifie  ces  allegations  de  "folklore ",  tandis  que  Ibrahim  Coulibaly,  sous 
Guillaume  Soro  et  de  son  clan. 

3 _85.  Selon  le  porte-parole  des  FANCI,  colonel  Yao  Yao,  il  s'agirait  "seulement"  de  "frapp 

3  _87.  De  son  cote,  le  porte-parole  des  FN,  Sidiki  Konate,  renvoie  a  le  retour  de  Guillaume 
impartiales"  chargees  de  surveiller  la  zone  tampon  entre  les  anciens  belligerents. 

3  _1 08.  II  suggere  que  ces  attaques  preludent  a  une  offensive  generale  comme  le  sous-e 

3  _1 09.  Pour  sa  part,  le  site  dedie  a  Laurent  Gbagbo  a  deja  ouvert  une  rubrique  "liberation 
explicite  de  le  chef  de  etat-major  permette  de  cerner  les  centres  de  decision  et  les  contour 
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1 5.  En  Cote- d'Ivoire,  les  accords  n'engagent  que  ceux 
Accra,  au  Ghana,  tenu  en  presence  d'une  dizaine  de  c| 

Kofi  Annan,  les  principaux  protagonistes  de  la  crise  ivc 
de  reconciliation  nationale. 

Baptise  «Accra  lll»  (deux  precedents  accords  con^us  dans  la  capitale  ghaneenne  sont  restes  lettre  morte),  ce 
compromis  prevoyait  I'adoption  avant  le  30  septembre  des  principals  reformes  echaffaudees  a  Marcoussis, 
en  janvier  2003.  Quinze  jours  plus  tard,  le  processus  de  desarmement  des  ex-rebelles  au  nord  et  des  milices 
pro-gouvernementales  devait  demarrer.  Las,  ce  calendrier  ne  sera  pas  respecte,  faisant  redouter  un  regain 
de  tension  dans  I'ancienne  colonie  franfaise.  Lundi  soir,  le  Conseil  de  security  de  I'ONU  s'est  inquiete  du 
«manque  de  progress  en  Cote- d'Ivoire. 

A  I'issue  de  la  session  extraordinaire  convoquee  depuis  debut  aout,  le  Parlement  n'a  adopte  qu'un  seul  des 
septtextes  prevus,  celui  sur  le  financement  des  partis  politiques.  Les  mesures  les  plus  sensibles  -les 
reformes  des  conditions  d'eligibilite  a  la  presidence  (article  35),  cede  du  Code  de  la  nationality  et  celle  de  la 
propriety  fonciere-  n'ont  meme  pas  depasse  le  stade  des  debats  en  commission. 

La  faute  a  qui?  Chacun  se  renvoie  la  bade.  Exemple  cite  par  un  diplomate  en  poste  a  Abidjan:  le  parti  de 
Gbagbo,  le  Front  populaire  ivoirien,  a  presente  un  amendement  sur  le  code  de  la  nationality  qui,  selon  I' 
opposition,  vidait  de  sa  substance  le  texte.  Aucune  des  parties  n'etant  prete  a  faire  de  concession,  le  texte  est 
aujourd'hui  en  desherence.  Pourun  obervateur  ivoirien  averti,  qui  a  requis  I'anonymat,  le  noeud  du  probleme 
demeure  neanmoins  «le  refus  de  I'entourage  de  Gbagbo  de  modifier  I' article  35,  ouvrant  la  voie  a  une 
candidature  d' Alassane  Ouattara  a  la  presidentielle  de  I'annee  prochaine.* 

Face  au  risque  d'enlisement,  I'ONU  a  demande  au  president  Gbagbo  d'honorer  sa  signature  d' Accra  et  aux 
ex-rebelles  de  commencer  a  desarmer.  Mais  ces  derniers  ont  deja  annonce  qu'ils  n'en  feraient  rien  tant  qu' 
aucune  avancee  n'  est  enregistree  sur  le  front  legislate. 

1 6.  La  Cote-d' Ivoire  est  en  pleine  phase  de  regression.  Paralysie  politique  au  Parlement,  rumeurs  de  coup  d' 
Etat,  manifestations  quotidiennes  des  «patriotes»  -  les  partisans  du  president  Gbagbo  -  devantla  base 
militaire  frangaise  d' Abidjan,  expedition  punitive  des  forces  de  I'ordre  dans  des  quartiers  reputes  proches  de 


le 


de 


3  _1 1 4.  Conduite  par  Guillaume  Soro,  la  direction  politique  des  FN  etait  justement  a  Lome  jeudi  pour  expliquer  a  le  president  togolais,  Gnassingbe  Eyadema,  qu'il  "est  imperatif  de  faire  en  sorte  que  les 
hostilites  ne  reprennent  pas". 

3  _1 1 9.Le  28  octobre  dernier,  les  FN  avaient  decrete  un  couvre-feu  dans  leurzone  de  controle  mise  "en  alerte  maximale"  en  affirmant  avoir  decouvert  un  transport  "de  armements  lourds"  expediees,  selon  eux, 
par  le  "camp  presidentiel"  a  des  "mercenaires  et  des  agents  doubles"  infiltres  pour  preter  main  forte  a  les  partisans  de  Ibrahim  Coulibaly  -  le  fameux  putschiste  place  sous  controle  judiciaire  en  France  qui 
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•  Efficiency  of  the  learning  algorithm 

Corpus:  financial  corpus 

Relations:  sales  and  purchasing  relations  between  companies 

■  Enter:  20  couples  (Transiciel  /  In  Dreach  Consulting,  Thales/  L-3,  ...) 

■  First  result:  28  patterns  built 

■  Use  of  the  patterns:  2  new  couples 

■  Second  result:  4  more  patterns 

■  Result:  efficient  algorithm  =>  20  couples  permitted  the  capture  of  32 
patterns. 

•  Efficiency  of  the  learned  patterns  on  a  new  corpus 

■  Precision  :  100% 

■  Recall:  26% 

■  Result:  efficient  patterns  =>  %  of  new  relations  were  automatically 
extracted 
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•  With  the  Land  Headquarters,  on  the  Ivory  Coast  crisis 

■  Test  of  the  usability,  by  a  non-linguist 

■  Test  of  the  time  efficiency,  combined  with  Ideliance 

■  Test  of  the  use  with  several  relations  and  entity  categories 

•  Entity  categories:  Person,  Organization,  Location,  Mean 

•  Relations:  Meeting,  Appointment,  Support,  Moving,  ... 


■  Test  of  the  learning  algorithm 

•  not  efficient: 

.  small  size  of  the  acquisition  corpus  (120  sentences) 

.  lot  of  relations  to  identify 
.  few  repetitions 

•  Initial  corpus  must  be  large  enough  for  the  learning  stage. 
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•  Learning  method 

■  no  need  of  pre-defined  patterns 

■  no  need  of  linguist  to  construct  pertinent  patterns 

■  pertinent  on  various  domain,  with  an  adapted  acquisition  corpus 

•  Sem+  tool 

■  Pattern  Learning,  and  Information  Extraction 

■  To  combine  with  Ideliance 

•  New  Experiment 

■  Use  of  Sem+  in  an  audio  filtering  platform,  to  extract  relations  from  audio 
transcriptions  of  English  audio  data 
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Conclusion  O 


•  Improvements 

■  Person  tag  managing  initial  sections 

«  President  Jacques  Chirac  »  =  «  French  President  Chirac  »  =  «  Jacques  Chirac  » 

■  Distinction  between  a  Location  and  an  Organization 

«  France  accuses  Gbagbo  » 

■  Enlarge  patterns 

«  Alassane  Ouattara  will  arrive  in  Pretoria  »  =>  <to_arrive>  {in,  at} 

■  To  characterize  a  good  acquisition  corpus 

•  Next  challenge:  integration  in  a  global  framework  for  cognitive  situation 
awareness  (Laudy,  Mattioli,  Museux  2005) 

■  Identification  and  tag  of  other  entities:  «  explosion  »,  «  1  death  »,  ... 

■  Management  of  relations  with  various  number  of  entities: 

«  Laurent  Gbagbo  has  invited  on  Monday  M.  Mbeki  to  Cocody  » 

■  Anaphora  solving  via  discourse  fusion 

«  Laurent  Gbagbo  ...  .  He  has  invited  M.  Mbeki  » 
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